In this demo, we use CyTOF data of PBMC samples from 4 patients, each containing between 500 and 1000 cells. For each sample, the expression of 10 cell surface and 14 signaling markers was measured before (REF) and upon BCR/FcR-XL stimulation (BCRXL) with B cell receptor/Fc receptor crosslinking for 30’, resulting in a total of 8 samples.
This data is stored as a flowSet object PBMC_fs, where each sample is stored in a separate flowFrame:
library(CATALYST)
## Warning: package 'S4Vectors' was built under R version 4.1.3
data(PBMC_fs)
PBMC_fs
## A flowSet with 8 experiments.
##
## column names(24): CD3(110:114)Dd CD45(In115)Dd ... HLA-DR(Yb174)Dd
## CD7(Yb176)Dd
Additionally, we load two metadaa tables: PBMC_md and PBMC_panel that contain metadata on each sample and marker, respectively:
data(PBMC_md, PBMC_panel)
PBMC_md
## file_name sample_id condition patient_id
## 1 PBMC_patient1_BCRXL.fcs BCRXL1 BCRXL Patient1
## 2 PBMC_patient1_Ref.fcs Ref1 Ref Patient1
## 3 PBMC_patient2_BCRXL.fcs BCRXL2 BCRXL Patient2
## 4 PBMC_patient2_Ref.fcs Ref2 Ref Patient2
## 5 PBMC_patient3_BCRXL.fcs BCRXL3 BCRXL Patient3
## 6 PBMC_patient3_Ref.fcs Ref3 Ref Patient3
## 7 PBMC_patient4_BCRXL.fcs BCRXL4 BCRXL Patient4
## 8 PBMC_patient4_Ref.fcs Ref4 Ref Patient4
head(PBMC_panel)
## fcs_colname antigen marker_class
## 1 CD3(110:114)Dd CD3 type
## 2 CD45(In115)Dd CD45 type
## 3 pNFkB(Nd142)Dd pNFkB state
## 4 pp38(Nd144)Dd pp38 state
## 5 CD4(Nd145)Dd CD4 type
## 6 CD20(Sm147)Dd CD20 type
SingleCellExperimentFrom the three data objects above, we construct an object of the SingleCellExperiment (SCE) class, which extends the SummarizedExperiment class with specialized methods for single-cell data:
sce <- prepData(PBMC_fs, PBMC_panel, PBMC_md)
In a SCE, assays contain matrix-like measurement data (e.g., transcript counts
in scRNA-seq, ion counts in CyTOF, fluorescent intensities in FACS), where rows = features (transcripts, genes, proteins) and columns = observations (single cells, samples):
# our SCE contains ion counts &
# expression-like transformed counts
assayNames(sce)
## [1] "counts" "exprs"
# we have 24 markers & about 5.5k cells
dim(sce)
## [1] 24 5428
rownames(sce)
## [1] "CD3" "CD45" "pNFkB" "pp38" "CD4" "CD20" "CD33" "pStat5"
## [9] "CD123" "pAkt" "pStat1" "pSHP2" "pZap70" "pStat3" "CD14" "pSlp76"
## [17] "pBtk" "pPlcg2" "pErk" "pLat" "IgM" "pS6" "HLA-DR" "CD7"
Two DataFrame objects, row/colData, accompany rows/columns and store feature/observation metadata:
head(rowData(sce))
## DataFrame with 6 rows and 3 columns
## channel_name marker_name marker_class
## <character> <character> <factor>
## CD3 CD3(110:114)Dd CD3 type
## CD45 CD45(In115)Dd CD45 type
## pNFkB pNFkB(Nd142)Dd pNFkB state
## pp38 pp38(Nd144)Dd pp38 state
## CD4 CD4(Nd145)Dd CD4 type
## CD20 CD20(Sm147)Dd CD20 type
head(colData(sce))
## DataFrame with 6 rows and 3 columns
## sample_id condition patient_id
## <factor> <factor> <factor>
## 1 BCRXL1 BCRXL Patient1
## 2 BCRXL1 BCRXL Patient1
## 3 BCRXL1 BCRXL Patient1
## 4 BCRXL1 BCRXL Patient1
## 5 BCRXL1 BCRXL Patient1
## 6 BCRXL1 BCRXL Patient1
Additionally, the metadata slot stores a list of experiment-wide metadata:
names(md <- metadata(sce))
## [1] "experiment_info"
head(md$experiment_info)
## sample_id condition patient_id n_cells
## 1 BCRXL1 BCRXL Patient1 528
## 2 Ref1 Ref Patient1 881
## 3 BCRXL2 BCRXL Patient2 665
## 4 Ref2 Ref Patient2 438
## 5 BCRXL3 BCRXL Patient3 563
## 6 Ref3 Ref Patient3 660
We can quickly visualize the distribution of cells across the 8 samples:
plotCounts(sce, group_by = "sample_id", color_by = "patient_id")
Next, we use the cluster() function to perform FlowSOM clustering and ConsensusClusterPlus metaclustering into xdim x ydim clusters, and 2 through maxK metaclusters:
sce <- cluster(sce, xdim = 10, ydim = 10, maxK = 20, verbose = FALSE)
All clusterings (resolutions 100, 2, 3, …, 20) are now available in the SCE. We can acess the cluster assignments for a specific clustering k via cluster_ids(., k), or view a table of all available clusterings with cluster_codes(.):
# view available clusterings
names(codes <- cluster_codes(sce))
## [1] "som100" "meta2" "meta3" "meta4" "meta5" "meta6" "meta7" "meta8"
## [9] "meta9" "meta10" "meta11" "meta12" "meta13" "meta14" "meta15" "meta16"
## [17] "meta17" "meta18" "meta19" "meta20"
# tabulate cells for a specific resolution
table(cluster_ids(sce, "meta8"))
##
## 1 2 3 4 5 6 7 8
## 708 1195 1293 449 537 605 297 344
A neat way to visualize how the different resolutions are hierarchically linked is with the clustree package:
library(clustree)
clustree(codes, prefix = "meta")
The methodology presented here is based on a dichotomy of markers into “type” markers (used for clustering cells into subpopulations) and “state” markers (tested for differences in expression across conditions). The set of type and state markers defined in the SCE can be viewed via type/state_markers(.):
type_markers(sce)
## [1] "CD3" "CD45" "CD4" "CD20" "CD33" "CD123" "CD14" "IgM"
## [9] "HLA-DR" "CD7"
state_markers(sce)
## [1] "pNFkB" "pp38" "pStat5" "pAkt" "pStat1" "pSHP2" "pZap70" "pStat3"
## [9] "pSlp76" "pBtk" "pPlcg2" "pErk" "pLat" "pS6"
By default, cluster() has used the type markers to group cells into subpopulations. Thus, we could use the expression of these markers to assign biologically meaningful labels to our clusters:
plotExprHeatmap(sce, features = "type", by = "cluster_id")
One of the most popular ways to visualize single cell data is via dimensionality reduction (DR), where each cell is projected into a lower, usually two-dimensional, space. Here, we compute a nonlinear dimensionality reduction technique, uniform manifold approximation and projection (UMAP), using the runDR() function:
sce <- runDR(sce, dr = "UMAP")
We can visualize the above two-dimensional embedding of cells using plotDR() and, for example, color cells by their cluster assignment, patient ID, and experimental treatment group:
plotDR(sce, color_by = "meta12")
plotDR(sce, color_by = "patient_id")
plotDR(sce, color_by = "condition")
We can also color cells by the expression of specific markers. And furthermore split the plot by another variable of interest, say treatment:
plotDR(sce, color_by = c("pS6", "pBtk", "pp38"), facet_by = "condition")
pbMDS() renders a multi-dimensional scaling (MDS) plot on median expression values. Such a plot will give a sense of similarities between samples in an unsupervised way and of key difference in expression before conducting any formal testing:
pbMDS(sce, by = "sample_id")
pbMDS(sce, by = "both")
The diffcyt package implements methods for differential discover
in high-dimensional cytometry data
- differential state (DS) analysis to test for subpopulation-specific changes in expression across experimental conditions
- differential abundance (DA) analysis to test changes in relative abundance (frequency) of subpopulations across conditions
Without conducting any formal testing, we can visualize the expression of our 14 state markers (not used for clustering) across all samples to see whether there are visible changes across conditions:
plotMultiHeatmap(sce, hm1 = FALSE, scale = "never",
hm2 = state_markers(sce)[1:7])
plotMultiHeatmap(sce, hm1 = FALSE, scale = "never",
hm2 = state_markers(sce)[8:14])
To formally identify subpopulation-specific changes in expression across conditions, we first set up a model formula and contrast matrix, and then use the diffcyt() function for statistical testing:
library(diffcyt)
ei <- md$experiment_info
design <- createDesignMatrix(ei, "condition")
contrast <- createContrast(c(0, 1))
ds <- diffcyt(sce, ei, analysis_type = "DS", method_DS = "diffcyt-DS-limma",
design = design, contrast = contrast, clustering_to_use = "meta20")
diffcyt() returns a list of various objects. Differential testing results are stored in the first element’s rowData:
head(tbl <- rowData(ds$res))
## DataFrame with 6 rows and 9 columns
## cluster_id marker_id ID logFC AveExpr t p_val
## <factor> <factor> <character> <numeric> <numeric> <numeric> <numeric>
## 1 1 pNFkB 1 -1.47685 1.79315 -4.40364 0.001460778
## 2 2 pNFkB 2 -1.48163 1.54605 -5.81654 0.000197358
## 3 3 pNFkB 3 -1.02816 1.81292 -3.99371 0.002753439
## 4 4 pNFkB 4 -2.14277 2.42447 -5.55022 0.000281649
## 5 5 pNFkB 5 -2.26421 1.81401 -4.76644 0.000850179
## 6 6 pNFkB 6 -1.45570 1.75282 -5.09587 0.000528678
## p_adj B
## <numeric> <numeric>
## 1 0.01573146 -1.062223
## 2 0.00451360 0.892961
## 3 0.02513515 -2.114418
## 4 0.00525744 0.785925
## 5 0.01159953 -0.276018
## 6 0.00870764 -0.326383
Having extracted the above table, we can easily tabulate the number of marker-cluster combinations that are significant at a given FDR cutoff, say 5%:
fdr <- 0.05
table(tbl$p_adj < 0.05)
##
## FALSE TRUE
## 242 38
Lastly, we can visualize the top differential findings across all clusters:
plotDiffHeatmap(sce, tbl, top_n = 20)
So far, we’ve mostly looked at expression of markers across different samples, conditions and clusters. We could also investigate the relative abundance (frequency) of subpopulations across samples and treatment groups:
plotAbundances(sce, by = "sample_id")
plotAbundances(sce, by = "cluster_id")
Similar to the plots generated above, we can render a heatmap representation of these frequencies:
plotFreqHeatmap(sce)